28 research outputs found
REPRESENTATION LEARNING FOR ACTION RECOGNITION
The objective of this research work is to develop discriminative representations for human
actions. The motivation stems from the fact that there are many issues encountered while
capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration),
inter-action similarity, background motion, and occlusion of actors. Hence, obtaining
a representation which can address all the variations in the same action while maintaining
discrimination with other actions is a challenging task. In literature, actions have been represented
either using either low-level or high-level features. Low-level features describe
the motion and appearance in small spatio-temporal volumes extracted from a video. Due
to the limited space-time volume used for extracting low-level features, they are not able
to account for viewpoint and actor variations or variable length actions. On the other hand,
high-level features handle variations in actors, viewpoints, and duration but the resulting
representation is often high-dimensional which introduces the curse of dimensionality. In
this thesis, we propose new representations for describing actions by combining the advantages
of both low-level and high-level features. Specifically, we investigate various linear
and non-linear decomposition techniques to extract meaningful attributes in both high-level
and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build
action-specific dictionaries. Each dictionary retains only the discriminative information
for a particular action and hence reduces inter-action similarity. Then, a sparsity-based
classification method is proposed to classify the low-rank representation of clips obtained
using these dictionaries. We show that this representation based on dictionary learning improves
the classification performance across actions. Also, a few of the actions consist of
rapid body deformations that hinder the extraction of local features from body movements.
Hence, we propose to use a dictionary which is trained on convolutional neural network
(CNN) features of the human body in various poses to reliably identify actors from the
background. Particularly, we demonstrate the efficacy of sparse representation in the identification
of the human body under rapid and substantial deformation.
In the first two approaches, sparsity-based representation is developed to improve discriminability
using class-specific dictionaries that utilize action labels. However, developing
an unsupervised representation of actions is more beneficial as it can be used to both
recognize similar actions and localize actions. We propose to exploit inter-action similarity
to train a universal attribute model (UAM) in order to learn action attributes (common and
distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation,
a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains
redundant attributes of all other actions, we use factor analysis to extract a novel lowvi
dimensional action-vector representation for each clip. Action-vectors are shown to suppress
background motion and highlight actions of interest in both trimmed and untrimmed
clips that contributes to action recognition without the help of any classifiers.
It is observed during our experiments that action-vector cannot effectively discriminate
between actions which are visually similar to each other. Hence, we subject action-vectors
to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic
LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary
information across action-vectors using different local features followed by discriminative
embedding provides the best classification performance. Further, we explore
non-linear embedding of action-vectors using Siamese networks especially for fine-grained
action recognition. A visualization of the hidden layer output in Siamese networks shows
its ability to effectively separate visually similar actions. This leads to better classification
performance than linear embedding on fine-grained action recognition.
All of the above approaches are presented on large unconstrained datasets with hundreds
of examples per action. However, actions in surveillance videos like snatch thefts are
difficult to model because of the diverse variety of scenarios in which they occur and very
few labeled examples. Hence, we propose to utilize the universal attribute model (UAM)
trained on large action datasets to represent such actions. Specifically, we show that there
are similarities between certain actions in the large datasets with snatch thefts which help
in extracting a representation for snatch thefts using the attributes from the UAM. This
representation is shown to be effective in distinguishing snatch thefts from regular actions
with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing
actions which provide better discrimination than existing representations. The
first approach presents a dictionary learning based sparse representation for effective discrimination
of actions. Also, we propose a sparse representation for the human body based
on dictionaries in order to recognize actions with rapid body deformations. In the next
approach, a low-dimensional representation called action-vector for unsupervised action
recognition is presented. Further, linear and non-linear embedding of action-vectors is
proposed for addressing inter-action similarity and fine-grained action recognition, respectively.
Finally, we propose a representation for locating snatch thefts among thousands of
regular interactions in surveillance videos
Determining t in t-closeness using Multiple Sensitive Attributes
Many government agencies and other organizations often need to publish microdata, e.g., medical data or census data, for research and other purposes. Typically, such data is stored in a table, and each record (row) corresponds to one individual. Each record has a number of attributes, which can be divided into the following three categories. (1) Attributes that clearly identify individuals. These are known as explicit identiers and include Social Security Number, Address, and Name, and so on. (2) Attributes whose values when taken together can potentially identify an individual. These are known as quasi-identiers (QI), and may include, e.g., Zip-code, Birthdate, and Gender. (3) Attributes that are considered sensitive, such as Disease and Salary are known as Sensitive Attributes (SA). When releasing microdata, it is necessary to prevent the sensitive information of the individuals from being disclosed. Therefore, the objective is to limit the disclosure risk to an acceptable level while maximizing the utility.This can be achieved by anonymizing the data before release.Models like k-anonymity(to prevent linkage attacks), l-diversity(to prevent skewness attacks), t-closeness(to prevent background knowledge attacks) etc. have been proposed over the years which are collectively known as Privacy Preserving Data Publishing models. Here, a novel way in determining t and applying t-closeness for multiple sensitive attributes is presented. The only information required beforehand is the partitioning classes of Sensitive Attribute(s). Since, t-closeness is an NP-Hard problem, so knowing thee value of t greatly reduces the time required for anonymizing with various values of t . The rationale of using the measure of determining t is discussed with conclusive proof and speedup achieved is also shown
Interaction Visual Transformer for Egocentric Action Anticipation
Human-object interaction is one of the most important visual cues that has
not been explored for egocentric action anticipation. We propose a novel
Transformer variant to model interactions by computing the change in the
appearance of objects and human hands due to the execution of the actions and
use those changes to refine the video representation. Specifically, we model
interactions between hands and objects using Spatial Cross-Attention (SCA) and
further infuse contextual information using Trajectory Cross-Attention to
obtain environment-refined interaction tokens. Using these tokens, we construct
an interaction-centric video representation for action anticipation. We term
our model InAViT which achieves state-of-the-art action anticipation
performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and
EGTEA Gaze+. InAViT outperforms other visual transformer-based methods
including object-centric video representation. On the EK100 evaluation server,
InAViT is the top-performing method on the public leaderboard (at the time of
submission) where it outperforms the second-best model by 3.3% on mean-top5
recall
ClipSitu: Effectively Leveraging CLIP for Conditional Predictions in Situation Recognition
Situation Recognition is the task of generating a structured summary of what
is happening in an image using an activity verb and the semantic roles played
by actors and objects. In this task, the same activity verb can describe a
diverse set of situations as well as the same actor or object category can play
a diverse set of semantic roles depending on the situation depicted in the
image. Hence a situation recognition model needs to understand the context of
the image and the visual-linguistic meaning of semantic roles. Therefore, we
leverage the CLIP foundational model that has learned the context of images via
language descriptions. We show that deeper-and-wider multi-layer perceptron
(MLP) blocks obtain noteworthy results for the situation recognition task by
using CLIP image and text embedding features and it even outperforms the
state-of-the-art CoFormer, a Transformer-based model, thanks to the external
implicit visual-linguistic knowledge encapsulated by CLIP and the expressive
power of modern MLP block designs. Motivated by this, we design a
cross-attention-based Transformer using CLIP visual tokens that model the
relation between textual roles and visual entities. Our cross-attention-based
Transformer known as ClipSitu XTF outperforms existing state-of-the-art by a
large margin of 14.1\% on semantic role labelling (value) for top-1 accuracy
using imSitu dataset. {Similarly, our ClipSitu XTF obtains state-of-the-art
situation localization performance.} We will make the code publicly available.Comment: State-of-the-art results on Grounded Situation Recognitio
Defining Traffic States using Spatio-temporal Traffic Graphs
Intersections are one of the main sources of congestion and hence, it is
important to understand traffic behavior at intersections. Particularly, in
developing countries with high vehicle density, mixed traffic type, and
lane-less driving behavior, it is difficult to distinguish between congested
and normal traffic behavior. In this work, we propose a way to understand the
traffic state of smaller spatial regions at intersections using traffic graphs.
The way these traffic graphs evolve over time reveals different traffic states
- a) a congestion is forming (clumping), the congestion is dispersing
(unclumping), or c) the traffic is flowing normally (neutral). We train a
spatio-temporal deep network to identify these changes. Also, we introduce a
large dataset called EyeonTraffic (EoT) containing 3 hours of aerial videos
collected at 3 busy intersections in Ahmedabad, India. Our experiments on the
EoT dataset show that the traffic graphs can help in correctly identifying
congestion-prone behavior in different spatial regions of an intersection.Comment: Accepted in 23rd IEEE International Conference on Intelligent
Transportation Systems September 20 to 23, 2020. 6 pages, 6 figure
Snatch theft detection in unconstrained surveillance videos using action attribute modelling
In a city with hundreds of cameras and thousands of interactions daily among people, manually identifying crimes like chain and purse snatching is a tedious and challenging task. Snatch thefts are complex actions containing attributes like walking, running etc. which are affected by actor and view variations. To capture the variation in these attributes in diverse scenarios, we propose to model snatch thefts using a Gaussian mixture model (GMM) with a large number of mixtures known as universal attribute model (UAM). However, the number of snatch thefts typically recorded in a surveillance videos is not sufficient enough to train the parameters of the UAM. Hence, we use the large human action datasets like UCF101 and HMDB51 to train the UAM as many of the actions in these datasets share attributes with snatch thefts. Then, a super-vector representation for each snatch theft clip is obtained using maximum aposteriori (MAP) adaptation of the universal attribute model. However, super-vectors are high-dimensional and contain many redundant attributes which do not contribute to snatch thefts. So, we propose to use factor analysis to obtain a low-dimensional representation called action-vector that contains only the relevant attributes. For evaluation, we introduce a video dataset called Snatch 1.0 created from many hours of surveillance footage obtained from different traffic cameras placed in the city of Hyderabad, India. We show that using action-vectors snatch thefts can be better identified than existing state-of-the-art feature representations
Sparsity-inducing dictionaries for effective action classification
Action recognition in unconstrained videos is one of the most important challenges in computer vision. In this paper, we propose sparsity-inducing dictionaries as an effective representation for action classification in videos. We demonstrate that features obtained from sparsity based representation provide discriminative information useful for classification of action videos into various action classes. We show that the constructed dictionaries are distinct for a large number of action classes resulting in a significant improvement in classification accuracy on the HMDB51 dataset. We further demonstrate the efficacy of dictionaries and sparsity based classification on other large action video datasets like UCF50
Defining Traffic States using Spatio-temporal Traffic Graphs
Intersections are one of the main sources of congestion and hence, it is important to understand traffic behavior at intersections. Particularly, in developing countries with high vehicle density, mixed traffic type, and lane-less driving behavior, it is difficult to distinguish between congested and normal traffic behavior. In this work, we propose a way to understand the traffic state of smaller spatial regions at intersections using traffic graphs. The way these traffic graphs evolve over time reveals different traffic states - a) a congestion is forming (clumping), the congestion is dispersing (unclumping), or c) the traffic is flowing normally (neutral). We train a spatio-temporal deep network to identify these changes. Also, we introduce a large dataset called EyeonTraffic (EoT) containing 3 hours of aerial videos collected at 3 busy intersections in Ahmedabad, India. Our experiments on the EoT dataset show that the traffic graphs can help in correctly identifying congestion-prone behavior in different spatial regions of an intersection. © 2020 IEEE